30 research outputs found
Computation of moments for probabilistic finite-state automata
[EN] The computation of moments of probabilistic finite-state automata (PFA) is researched in this article. First, the computation of moments of the length of the paths is introduced for general PFA, and then, the computation of moments of the number of times that a symbol appears in the strings generated by the PFA is described. These computations require a matrix inversion. Acyclic PFA, such as word graphs, are quite common in many practical applications. Algorithms for the efficient computation of the moments for acyclic PFA are also presented in this paper.This work has been partially supported by the Ministerio de Ciencia y Tecnologia under the grant TIN2017-91452-EXP (IBEM), by the Generalitat Valenciana under the grant PROMETE0/2019/121 (DeepPattern), and by the grant "Ayudas Fundacion BBVA a equipos de investigacion cientifica 2018" (PR[8]_HUM_C2_0087).Sánchez Peiró, JA.; Romero, V. (2020). Computation of moments for probabilistic finite-state automata. Information Sciences. 516:388-400. https://doi.org/10.1016/j.ins.2019.12.052S388400516Sakakibara, Y., Brown, M., Hughey, R., Mian, I. S., Sjölander, K., Underwood, R. C., & Haussler, D. (1994). Stochastic context-free grammers for tRNA modeling. Nucleic Acids Research, 22(23), 5112-5120. doi:10.1093/nar/22.23.5112Álvaro, F., Sánchez, J.-A., & Benedí, J.-M. (2016). An integrated grammar-based approach for mathematical expression recognition. Pattern Recognition, 51, 135-147. doi:10.1016/j.patcog.2015.09.013Mohri, M., Pereira, F., & Riley, M. (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1), 69-88. doi:10.1006/csla.2001.0184Casacuberta, F., & Vidal, E. (2004). Machine Translation with Inferred Stochastic Finite-State Transducers. Computational Linguistics, 30(2), 205-225. doi:10.1162/089120104323093294Ortmanns, S., Ney, H., & Aubert, X. (1997). A word graph algorithm for large vocabulary continuous speech recognition. Computer Speech & Language, 11(1), 43-72. doi:10.1006/csla.1996.0022Soule, S. (1974). Entropies of probabilistic grammars. Information and Control, 25(1), 57-74. doi:10.1016/s0019-9958(74)90799-2Justesen, J., & Larsen, K. J. (1975). On probabilistic context-free grammars that achieve capacity. Information and Control, 29(3), 268-285. doi:10.1016/s0019-9958(75)90437-4Hernando, D., Crespi, V., & Cybenko, G. (2005). Efficient Computation of the Hidden Markov Model Entropy for a Given Observation Sequence. IEEE Transactions on Information Theory, 51(7), 2681-2685. doi:10.1109/tit.2005.850223Nederhof, M.-J., & Satta, G. (2008). Computation of distances for regular and context-free probabilistic languages. Theoretical Computer Science, 395(2-3), 235-254. doi:10.1016/j.tcs.2008.01.010CORTES, C., MOHRI, M., RASTOGI, A., & RILEY, M. (2008). ON THE COMPUTATION OF THE RELATIVE ENTROPY OF PROBABILISTIC AUTOMATA. International Journal of Foundations of Computer Science, 19(01), 219-242. doi:10.1142/s0129054108005644Ilic, V. M., Stankovi, M. S., & Todorovic, B. T. (2011). Entropy Message Passing. IEEE Transactions on Information Theory, 57(1), 375-380. doi:10.1109/tit.2010.2090235Booth, T. L., & Thompson, R. A. (1973). Applying Probability Measures to Abstract Languages. IEEE Transactions on Computers, C-22(5), 442-450. doi:10.1109/t-c.1973.223746Thompson, R. A. (1974). Determination of Probabilistic Grammars for Functionally Specified Probability-Measure Languages. IEEE Transactions on Computers, C-23(6), 603-614. doi:10.1109/t-c.1974.224001Wetherell, C. S. (1980). Probabilistic Languages: A Review and Some Open Questions. ACM Computing Surveys, 12(4), 361-379. doi:10.1145/356827.356829Sanchez, J.-A., & Benedi, J.-M. (1997). Consistency of stochastic context-free grammars from probabilistic estimation based on growth transformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(9), 1052-1055. doi:10.1109/34.615455Hutchins, S. E. (1972). Moments of string and derivation lengths of stochastic context-free grammars. Information Sciences, 4(2), 179-191. doi:10.1016/0020-0255(72)90011-4Heim, A., Sidorenko, V., & Sorger, U. (2008). Computation of distributions and their moments in the trellis. Advances in Mathematics of Communications, 2(4), 373-391. doi:10.3934/amc.2008.2.373Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., & Carrasco, R. C. (2005). Probabilistic finite-state machines - part I. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7), 1013-1025. doi:10.1109/tpami.2005.147Sánchez, J. A., Rocha, M. A., Romero, V., & Villegas, M. (2018). On the Derivational Entropy of Left-to-Right Probabilistic Finite-State Automata and Hidden Markov Models. Computational Linguistics, 44(1), 17-37. doi:10.1162/coli_a_0030
Hanwrittent Text Recognition for Bengali
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Handwritten text recognition of Bengali
is a difficult task because of complex character shapes
due to the presence of modified/compound characters
as well as zone-wise writing styles of different individuals.
Most of the research published so far on Bengali
handwriting recognition deals with either isolated
character recognition or isolated word recognition,
and just a few papers have researched on recognition
of continuous handwritten Bengali. In this paper
we present a research on continuous handwritten
Bengali. We follow a classical line-based recognition
approach with a system based on hidden Markov
models and n-gram language models. These models
are trained with automatic methods from annotated
data. We research both on the maximum likelihood
approach and the minimum error phone approach for
training the optical models. We also research on the
use of word-based language models and characterbased
language models. This last approach allow us
to deal with the out-of-vocabulary word problem in
the test when the training set is of limited size. From
the experiments we obtained encouraging results.This work has been partially supported through the European Union’s H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943) and partially supported by MINECO/FEDER, UE under project TIN2015-70924-C2-1-R.Sánchez Peiró, JA.; Pal, U. (2016). Hanwrittent Text Recognition for Bengali. IEEE. https://doi.org/10.1109/ICFHR.2016.010
IMEGE: Image-based Mathematical Expression Global Error
Mathematical expression recognition is an active research eld that is related to document image analysis and typesetting. Several approaches have been proposed to tackle this problem, and automatic methods for performance evaluation are required. Mathematical expressions are usually represented as a coded string like LATEX or MathML for evaluation purpose. This representation has ambiguity problems given that the same expression can be coded in several ways. For that reason, the proposed approaches in the past either manually analyzed recognition results or they reported partial errors as symbol error rate. In this study,
we present a novel global performance evaluation measure for mathematical expression based on image matching. In this way, using an image representation solves the representation ambiguity as well as human beings do. The proposed evaluation method is a global error measure that also provides local information about the recognition result.Álvaro Muñoz, F.; Sánchez Peiró, JA.; Benedí Ruiz, JM. (2011). IMEGE: Image-based Mathematical Expression Global Error. http://hdl.handle.net/10251/1308
An integrated grammar-based approach for mathematical expression recognition
This is the author’s version of a work that was accepted for publication in Pattern Recognition. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern Recognition 51 (2016) 135–147. DOI 10.1016/j.patcog.2015.09.013.Automatic recognition of mathematical expressions is a challenging pattern recognition problem since there are many ambiguities at different levels. On the one hand, the recognition of the symbols of the mathematical expression. On the other hand, the detection of the two-dimensional structure that relates the symbols and represents the math expression. These problems are closely related since symbol recognition is influenced by the structure of the expression, while the structure strongly depends on the symbols that are recognized. For these reasons, we present an integrated approach that combines several stochastic sources of information and is able to globally determine the most likely expression. This way, symbol segmentation, symbol recognition and structural analysis are simultaneously optimized. In this paper we define the statistical framework of a model based on two-dimensional grammars and its associated parsing algorithm. Since the search space is too large, restrictions are introduced for making the search feasible. We have developed a system that implements this approach and we report results on the large public dataset of the CROHME international competition. This approach significantly outperforms other proposals and was awarded best system using only the training dataset of the competition. (C) 2015 Elsevier Ltd. All rights reserved.This work was partially supported by the Spanish MINECO under the STraDA research project (TIN2012-37475-C02-01) and the FPU Grant (AP2009-4363).Álvaro Muñoz, F.; Sánchez Peiró, JA.; Benedí Ruiz, JM. (2016). An integrated grammar-based approach for mathematical expression recognition. Pattern Recognition. 51:135-147. https://doi.org/10.1016/j.patcog.2015.09.013S1351475
ICFHR2016 Competition on Handwritten Text Recognition on the READ Dataset
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] This paper describes the Handwritten Text Recognition (HTR) competition on the READ dataset that has been held in the context of the International Conference on Frontiers in Handwriting Recognition 2016. This competition
aims to bring together researchers working on off-line HTR and provide them a suitable benchmark to compare their techniques on the task of transcribing typical historical handwritten documents. Two tracks with different conditions on
the use of training data were proposed. Ten research groups registered in the competition but finally five submitted results. The handwritten images for this competition were drawn from the German document Ratsprotokolle collection composed of minutes of the council meetings held from 1470 to 1805, used
in the READ project. The selected dataset is written by several hands and entails significant variabilities and difficulties. The five participants achieved good results with transcriptions word error rates ranging from 21% to 47% and character error rates rating from 5% to 19%.This work has been partially supported through the European Union's H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943), and the MINECO/FEDER UE project TIN2015-70924-C2-1-R.Sánchez Peiró, JA.; Romero Gómez, V.; Toselli, AH.; Vidal, E. (2016). ICFHR2016 Competition on Handwritten Text Recognition on the READ Dataset. IEEE. https://doi.org/10.1109/ICFHR.2016.0120
Using the MGGI Methodology for Category-based Language Modeling in Handwritten Marriage Licenses Books
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Handwritten marriage licenses books have been
used for centuries by ecclesiastical and secular institutions
to register marriages. The information contained in these
historical documents is useful for demography studies and
genealogical research, among others. Despite the generally
simple structure of the text in these documents, automatic transcription
and semantic information extraction is difficult due
to the distinct and evolutionary vocabulary, which is composed
mainly of proper names that change along the time. In previous
works we studied the use of category-based language models to
both improve the automatic transcription accuracy and make
easier the extraction of semantic information. Here we analyze
the main causes of the semantic errors observed in previous
results and apply a Grammatical Inference technique known
as MGGI to improve the semantic accuracy of the language
model obtained. Using this language model, full handwritten
text recognition experiments have been carried out, with results
supporting the interest of the proposed approach.This work has been partially supported through the European Union’s H2020 grant READ (Ref: 674943), the European project ERC-2010-AdG-20100407-269796, the MINECO/FEDER, UE projects TIN2015-70924-C2-1-R and TIN2015-70924-C2-2-R, and the Ramon y Cajal Fellowship RYC-2014-16831.Romero Gómez, V.; Fornes, A.; Vidal Ruiz, E.; Sánchez Peiró, JA. (2016). Using the MGGI Methodology for Category-based Language Modeling in Handwritten Marriage Licenses Books. IEEE. https://doi.org/10.1109/ICFHR.2016.0069
Marco para parsing predictivo interactivo aplicado a la lengua castellana
El marco teórico de Parsing Predictivo Interactivo (IPP) permite construir sistemas de anotación sintáctica interactivos. Los anotadores humanos pueden utilizar estos sistemas de ayuda para crear árboles sintácticos con muy poco esfuerzo (en comparación con el trabajo requerido para corregir manualmente árboles obtenidos a partir de un analizador sintáctico completamente automático). En este artículo se presenta la adaptación a la lengua castellana del marco IPP y su herramienta de anotación IPP-Ann, usando modelos obtenidos a partir del UAM Spanish Treebank. Hemos llevado a cabo experimentación simulando al usuario para obtener métricas de evaluación objetivas para nuestro sistema. Estos resultados muestran que el marco IPP aplicado al UAM Spanish Treebank se traduce en una importante cantidad de esfuerzo ahorrado, comparable con el obtenido al aplicar el marco IPP para analizar la lengua inglesa mediante el Penn Treebank.The Interactive Predictive Parsing (IPP) framework allows us the construction of interactive tree annotation systems. These can help human annotators in creating error-free parse trees with little effort (compared to manually post-editing the trees obtained from a completely automatic parser). In this paper we adapt the IPP framework and the IPP-Ann annotation tool for parse of the Spanish language, by using models obtained from the UAM Spanish Treebank. We performed user simulation experimentation and obtained objective evaluation metrics. The results establish that the IPP framework over the UAM Treebank shows important amounts of user effort reduction, comparable to the gains obtained when applying IPP to the English language on the Penn Treebank.Work supported by the EC (FEDER, FSE), the Spanish Government and Generalitat Valenciana (MICINN, ”Plan E”, under grants MIPRCV ”Consolider Ingenio 2010” CSD2007-00018, MIT-TRAL TIN2009-14633-C03-01, ALMPR Prometeo/2009/014 and FPU AP2006-01363)
A Set of Benchmarks for Handwritten Text Recognition on Historical Documents
[EN] Handwritten Text Recognition is a important requirement in order to make visible the contents of the myriads of historical documents residing in public and private archives and libraries world wide. Automatic Handwritten Text Recognition (HTR) is a challenging problem that requires a careful combination of several advanced Pattern Recognition techniques, including but not limited to Image Processing, Document Image Analysis, Feature Extraction, Neural Network approaches and Language Modeling. The progress of this kind of systems is strongly bound by the availability of adequate benchmarking datasets, software tools and reproducible results achieved using the corresponding tools and datasets. Based on English and German historical documents proposed in recent open competitions at ICDAR and ICFHR conferences between 2014 and 2017, this paper introduces four HTR benchmarks in order of increasing complexity from several points of view. For each benchmark, a specific system is proposed which overcomes results published so far under comparable conditions. Therefore, this paper establishes new state of the art baseline systems and results which aim at becoming new challenges that would hopefully drive further improvement of HTR technologies. Both the datasets and the software tools used to implement the baseline systems are made freely accessible for research purposes. (C) 2019 Elsevier Ltd. All rights reserved.This work has been partially supported through the European Union's H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943), as well as by the BBVA Foundation through the 2017-2018 and 2018-2019 Digital Humanities research grants "Carabela" and "HisClima - Dos Siglos de Datos Cilmaticos", and by EU JPICH project "HOME - History Of Medieval Europe" (Spanish PEICTI Ref. PC12018-093122).Sánchez Peiró, JA.; Romero, V.; Toselli, AH.; Villegas, M.; Vidal, E. (2019). A Set of Benchmarks for Handwritten Text Recognition on Historical Documents. Pattern Recognition. 94:122-134. https://doi.org/10.1016/j.patcog.2019.05.025S1221349
ICFHR2014 Competition on Handwritten Text Recognition on tranScriptorium Datasets (HTRtS)
©2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.A contest on Handwritten Text Recognition organised
in the context of the ICFHR 2014 conference is described.
Two tracks with increased freedom on the use of training
data were proposed and three research groups participated
in these two tracks. The handwritten images for this contest
were drawn from an English data set which is currently being
considered in the tranScriptorium project. The the goal of
this project is to develop innovative, efficient and cost-effective
solutions for the transcription of historical handwritten document
images, focusing on four languages: English, Spanish,
German and Dutch. For the English language, the so-called
“Bentham collection” is being considered in tranScriptorium.
It encompasses a large set of manuscripts written by the
renowned English philosopher and reformer Jeremy Bentham
(1748-1832). A small subset of this collection has been chosen
for the present HTR competition. The selected subset has been
written by several hands (Bentham himself and his secretaries)
and entails significant varibilities and difficulties regarding the
quality of text images and writting styles. Training and test
data were provided in the form of carefully segmented line
images, along with the corresponding transcripts. The three
participants achieved very good results, with transcription
word error rates ranging from 15.0% down to 8.6%.The research leading to these results has received funding from the European Union’s Seventh Framework Pro-gramme (FP7/2007-2013) under grant agreement no. 600707- tranScriptorium. The authors would like to thank all theTRANSCRIPTORIUMmembers for their collaboration and the entrants for their participation in this contest.Sánchez Peiró, JA.; Romero Gómez, V.; Toselli, AH.; Vidal Ruiz, E. (2014). ICFHR2014 Competition on Handwritten Text Recognition on tranScriptorium Datasets (HTRtS). IEEE. https://doi.org/10.1109/ICFHR.2014.137
Multimodal Interactive Parsing
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-38628-2_57Probabilistic parsing is a fundamental problem in Computational Linguistics, whose goal is obtaining a syntactic structure associated to a sentence according to a probabilistic grammatical model. Recently, an interactive framework for probabilistic parsing has been introduced, in which the user and the system cooperate to generate error-free parse trees. In an early prototype developed according to this interactive parsing technology, user feedback was provided by means of mouse actions and keyboard strokes. Here we augment the interaction style with support for (non-deterministic) natural handwritten recognition, and provide confidence measures as a visual aid to ease the correction process. Handwriting input seems to be a modality specially suitable for parsing, since the vocabulary size involved in the recognition of syntactic labels is fairly limited and thus intuitively errors should be small. However, errors may increase as handwriting quality (i.e., calligraphy) degrades. To solve this problem, we introduce a late fusion approach that leverages both on-line and off-line information, corresponding to pen strokes and contextual information from the parse trees. We demonstrate that late fusion can effectively help to disambiguate user intention and improve system accuracy.This research has received funding from the EC’s 7th
Framework Programme (FP7/2007-13) under grant agreement No.287576-
CasMaCat; from the Spanish MEC under the STraDA project (TIN2012-37475-
C02-01) and the MITTRAL project (TIN2009-14633-C03-01); from the GV
under the Prometeo project; and from the Universidad del Cauca (Colombia)Benedí Ruiz, JM.; Sánchez Peiró, JA.; Leiva, LA.; Sánchez Sáez, R.; Maca, M. (2013). Multimodal Interactive Parsing. En Pattern Recognition and Image Analysis. Springer. 484-491. https://doi.org/10.1007/978-3-642-38628-2_57S484491Afonso, S., Bick, E., Haber, R., Santos, D.: Floresta sintá(c)tica: a treebank for portuguese. In: Proc. LREC, pp. 1698–1703 (2002)Brants, T., Plaehn, O.: Interactive corpus annotation. In: Proc. LREC (2000)Guyon, I., Schomaker, L., Plamondon, R., Liberman, M., Janet, S.: UNIPEN project of on-line data exchange and recognizer benchmarks. In: Proc. ICPR, pp. 29–33 (1994)Lease, M., Charniak, E., Johnson, M., McClosky, D.: A look at parsing and its applications. In: Proc. AAAI, pp. 1642–1645 (2006)Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)Ortiz, D., Leiva, L.A., Alabau, V., Casacuberta, F.: Interactive machine translation using a web-based architecture. In: Proc. IUI, pp. 423–425 (2010)Romero, V., Leiva, L.A., Toselli, A.H., Vidal, E.: Interactive multimodal transcription of text images using a web-based demo system. In: Proc. IUI, pp. 477–478 (2009)Sánchez-Sáez, R., Leiva, L.A., Sánchez, J.A., Benedí, J.M.: Interactive predictive parsing using a web-based architecture. In: Proc. NAACL-HLT, pp. 37–40 (2010)Sánchez-Sáez, R., Sánchez, J.A., Benedí, J.M.: Interactive predictive parsing. In: Proc. IWPT, pp. 222–225 (2009)Sánchez-Sáez, R., Sánchez, J.A., Benedí, J.M.: Confidence measures for error discrimination in an interactive predictive parsing framework. In: Proc. COLING, pp. 1220–1228 (2010